class: center, middle, inverse, title-slide .title[ # CANSIM and Data Visualization ] .subtitle[ ## University of Alberta - Cities Institute ] .author[ ### Stephen Rouse ] .date[ ### 2025-08-25 ] --- ## Quick Notes Slides & code are available from the link in the [chat](https://github.com/stephenr222/Teaching_Materials) for those wanting to follow along and copy-paste code on their own machine. The main coding packages we'll be working with in this workshop were developed by [Jens Vonn Bergmann](https://www.linkedin.com/in/vb-jens/), Canada's leading expert on housing data. - If you want to learn more after this course, his [blog](https://doodles.mountainmath.ca/) is a great place to start. --- ## Introduction .pull-left[  ] .pull-right[ My name's Stephen Rouse. I'm an Economist with lots of experience: - Researching national rental market trends at CMHC - Modelling CMHC's multibillion-dollar mortgage product - Developing the physician forecasting model for BC If you want to keep in touch (or send me hate-mail) after the course, feel free to connect with me on [LinkedIn](https://www.linkedin.com/in/stephenrouse2/). ] --- ## {cansim} and Data Visualization StatCan data is at the base of a lot of analysis in Canada. StatCan tables (formerly CANSIM) cover current and past timelines of Canadian socio-economic data. They update regularly, which means there is a high pay-off for reproducible and adaptable workflows and analysis. -- * **Reproducible**: can be repeated by others with minimal work, can be repeated as new data comes in (also auditable) -- * **Adaptable**: can be easily tweaked to accomplish related tasks -- * **Polished visualizations**: can be directly used for publication, with no (or minimal) extra work -- The [*R* programming language](https://www.r-project.org). Ideally using the [RStudio IDE](https://rstudio.com). We will be working with [RMarkdown](https://bookdown.org/yihui/rmarkdown/notebook.html) documents, and need the following packages: - [`cansim` package](https://mountainmath.github.io/cansim/) to access census data via the [StatCan NDM API](https://www.statcan.gc.ca/eng/developers/wds) - [tidyverse](https://www.tidyverse.org), an *opinionated collection of R packages* for intuitive general-purpose data manipulations and visualization capabilities. --- ## Agenda We will explore how to * access StatCan tables data * explore the datasets and perform basic data manipulations * perform basic descriptive analysis and visualization --- ## Getting set up For those wishing to follow along live-coding, open a new RMarkdown document. .pull-left[  ] .pull-right[  ] --- .pull-left[ ### RMarkdown RMarkdown allows mixing of code and text. Analysis, visualization and report are all one and the same document. ``` r library(tidyverse) data(mtcars) ggplot(mtcars,aes(x=mpg)) + geom_histogram(binwidth=4) ``` <!-- --> ] .pull-right[  RMarkdown documents can be compiled to **HTML**, **PDF** or **Word**. The output of code blocks gets inserted into the document. We can show or hide the actual code. These slides are entirely done in RMarkdown. ] --- # cansim .pull-left[ The [`cansim` R package](https://mountainmath.github.io/cansim/) interfaces with the StatCan NDM that replaces the former CANSIM tables. It can be queried for - whole tables - specific vectors - data discovery searching through tables It encodes the metadata and allows to work with the internal hierarchical structure of the fields. ``` r #install.packages("tidyverse") library(tidyverse) #install.packages("cansim") library(cansim) ``` ] .pull-right[ <img src="https://raw.githubusercontent.com/mountainMath/cansim/master/images/cansim-sticker.png" alt="cansim" style="height:500px;margin-top:-80px;"> ] --- class: medium-code ## Finding a dataset to use Right now we have two ways: * Google * Manually searching through StatCan's website CANSIM has a function that helps us cut through all this noise and search for exactly what we want. .pull-bottom[ .small-font[ ``` r search_cansim_tables("motor vehicle sales") %>% select(cansim_table_number,title) %>% knitr::kable() ``` |cansim_table_number |title | |:-------------------|:-----------------------------------------------------| |20-10-0001 |New motor vehicle sales, inactive | |20-10-0002 |New motor vehicle sales, by type of vehicle, inactive | |20-10-0085 |New motor vehicle sales, monthly | |20-10-0086 |New motor vehicle sales, by vehicle type, annual | ]] --- .pull-top[  ] -- StatCan is working on better data discovery, hopefully things will get easier in the future. --- ## First example: Motor vehicle sales To start off we grab data on motor vehicle sales from table 20-10-0001 and inspect the available variables. ``` r mv_sales <- get_cansim("20-10-0001") %>% normalize_cansim_values(factors=TRUE) mv_sales %>% select_if(is.factor) %>% lapply(levels) ``` ``` ## $GEO ## [1] "Newfoundland and Labrador" ## [2] "Prince Edward Island" ## [3] "Nova Scotia" ## [4] "New Brunswick" ## [5] "Quebec" ## [6] "Ontario" ## [7] "Manitoba" ## [8] "Saskatchewan" ## [9] "Alberta" ## [10] "British Columbia and the Territories" ## [11] "Canada" ## ## $`Vehicle type` ## [1] "Passenger cars" "Trucks" ## [3] "Total, new motor vehicles" ## ## $`Origin of manufacture` ## [1] "North America" "Total, overseas" ## [3] "Japan" "Other countries" ## [5] "Total, country of manufacture" ## ## $Sales ## [1] "Units" "Dollars" ## ## $`Seasonal adjustment` ## [1] "Unadjusted" "Seasonally adjusted" ``` --- ## Motor vehicle sales ``` r plot_data <- mv_sales %>% filter(GEO=="Canada", Date>=as.Date("1990-01-01"), `Vehicle type`!="Total, new motor vehicles", `Origin of manufacture`=="Total, country of manufacture", Sales=="Units", `Seasonal adjustment`=="Unadjusted") ggplot(plot_data,aes(x=Date,y=VALUE,color=`Vehicle type`)) + geom_line() ``` <!-- --> --- ## Motor vehicle sales (nicer graph) ``` r g <- ggplot(plot_data, aes(x=Date,y=VALUE,color=`Vehicle type`)) + theme_light() + geom_line(alpha=0.2) + geom_smooth(span=0.1) + scale_y_continuous(labels=function(d)scales::comma(d,scale=1/1000,suffix="k")) + labs(title="Canadian new motor vehicle sales",x=NULL,y="Sales per month", caption="StatCan Table 20-10-0001") g ``` <!-- --> --- class: medium-code ## Motor vehicle sales (annotated graph, final version) ``` r library(ggrepel) # (for nicer labels) g + geom_text_repel(data=~filter(.,Date==as.Date("1990-08-01"),`Vehicle type`=="Passenger cars"), label="Passenger cars",hjust=0,nudge_y = 30000) + geom_text_repel(data=~filter(.,Date==as.Date("2016-11-01"),`Vehicle type`=="Trucks"), label="Trucks, SUVs, Vans, Buses", hjust=1,nudge_x = -2000,nudge_y=10000) + scale_color_manual(values=c("Passenger cars"="steelblue","Trucks"="brown"),guide=FALSE) ``` <!-- --> --- class: medium-code ## Motor vehicle sales (seasonality) ``` r plot_data %>% rename(type=`Vehicle type`) %>% ggseas::ggsdc(aes(x=Date,y=VALUE,color=type),method="stl",frequency = 12,s.window = 24) + theme_light() + geom_line() + scale_color_manual(values=c("Passenger cars"="steelblue","Trucks"="brown"),guide=FALSE) + scale_y_continuous(labels=function(d)scales::comma(d,scale=1/1000,suffix="k")) + labs(title="Canadian new motor vehicle sales (STL decomposition)",x=NULL,y="Sales per month", colour = "Vehicle type",caption="StatCan Table 20-10-0001") ``` <!-- --> --- ## Cigarette sales .pull-left[ Sometimes we are just interested in one specific variable. It can be easier to pull in the StatCan vector. Vector discovery can be cumbersome, downloaded table data has it. The web view has it too. We can go to the web view of table right from the R console using the following command. ``` r view_cansim_webpage("16-10-0044") ``` Selecting the **Add/Remove data** option allows us to filter the data down to what we want and enable the display of StatCan vectors. ] .pull-right[  ] --- class: medium-code ## Cigarette sales ``` r g <- get_cansim_vector("v28536414","1800-01-01") %>% normalize_cansim_values() %>% ggplot(aes(x=Date,y=VALUE)) + geom_line() + geom_smooth(span=0.25,se=FALSE) + scale_y_continuous(labels=function(d)scales::comma(d,scale = 10E-9,suffix="bn")) + labs(title="Canadian cigarette sales",x=NULL,y="Monthly sales",caption="StatCan vector v28536414") g ``` <!-- --> --- class: medium-code ## Cigarette sales (adding context) ``` r g + geom_vline(xintercept = as.Date(c("2005-03-18","2005-08-22","2006-05-31","2006-12-01", "2007-01-01","2007-06-17","2010-07-05","2012-02-22")), linetype="dashed", color="brown") + geom_label(x=as.Date("2010-01-01"),y=3.3E9,label="Partial smoking bans\nand advertising limitations", hjust=0.5,color="brown") + geom_label(x=as.Date("2020-08-01"),y=2.5E9, label="No visible impact\nof COVID-19", hjust=0.8) ``` <!-- --> --- class: medium-code ## Searching for a New Table .medium-font[ ``` r search_cansim_tables("job vacancy") %>% select(cansim_table_number,title) %>% knitr::kable() ``` |cansim_table_number |title | |:-------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------| |13-10-0908 |Oral health care providers, job vacancies, jobs occupied, and job vacancy rate by provider type | |14-10-0001 |Average full-time hourly wage paid and payroll employment by type of work, economic region and occupation | |14-10-0080 |Average lowest and average highest full-time hourly wages paid at business locations by occupation | |14-10-0103 |Average full-time hourly wage paid and payroll employment by type of work, industry and occupation | |14-10-0224 |Job vacancies, labour demand and job vacancy rate, three-month moving average, unadjusted for seasonality, inactive | |14-10-0225 |Job vacancies, labour demand and job vacancy rate, annual, inactive | |14-10-0226 |Job vacancies, unemployed and unemployment-to-job vacancies ratio, three-month moving average, unadjusted for seasonality, inactive | |14-10-0227 |Job vacancies, unemployed and unemployment-to-job vacancies ratio, annual, inactive | |14-10-0325 |Job vacancies, payroll employees, job vacancy rate, and average offered hourly wage by provinces and territories, quarterly, unadjusted for seasonality, inactive | |14-10-0326 |Job vacancies, payroll employees, job vacancy rate, and average offered hourly wage by industry sector, quarterly, unadjusted for seasonality, inactive | |14-10-0328 |Job vacancies, proportion of job vacancies and average offered hourly wage by selected characteristics, quarterly, unadjusted for seasonality, inactive | |14-10-0356 |Job vacancies and average offered hourly wage by occupation (broad occupational category), quarterly, unadjusted for seasonality, inactive | |14-10-0371 |Job vacancies, payroll employees, and job vacancy rate by provinces and territories, monthly, unadjusted for seasonality | |14-10-0372 |Job vacancies, payroll employees, and job vacancy rate by industry sector, monthly, unadjusted for seasonality | |14-10-0398 |Job vacancies, payroll employees, and job vacancy rate, by economic regions, quarterly, adjusted for seasonality | |14-10-0399 |Job vacancies by occupation (broad occupational category), by type of work and position, quarterly, adjusted for seasonality | |14-10-0400 |Job vacancies, payroll employees, and job vacancy rate by industry sector, quarterly, adjusted for seasonality | |14-10-0406 |Job vacancies, payroll employees, and job vacancy rate by industry sector, monthly, adjusted for seasonality | |14-10-0432 |Job vacancies, payroll employees, and job vacancy rate by provinces and territories, monthly, adjusted for seasonality | |14-10-0441 |Job vacancies, payroll employees, job vacancy rate, and average offered hourly wage by economic regions, quarterly, unadjusted for seasonality | |14-10-0442 |Job vacancies, payroll employees, job vacancy rate, and average offered hourly wage by industry sub-sector, quarterly, unadjusted for seasonality | |14-10-0443 |Job vacancies, proportion of job vacancies and average offered hourly wage by occupation and selected characteristics, quarterly, unadjusted for seasonality | |14-10-0444 |Job vacancies and average offered hourly wage by occupation (unit group), quarterly, unadjusted for seasonality | ] --- class: medium-code ## Combining StatCan tables To understand growth of jobs we combined LFS employment data with JVWS data on job vacancies. ``` r library(lubridate) lfs_data <- get_cansim("14-10-0293") %>% normalize_cansim_values() %>% filter(`Labour force characteristics`=="Employment", Statistics=="Estimate") jv_data <- get_cansim("14-10-0325") %>% normalize_cansim_values() %>% mutate(Date=Date %m+% months(1)) %>% bind_rows(get_cansim("14-10-0371") %>% normalize_cansim_values()) %>% filter(Statistics=="Job vacancies") ``` Job vacancies used to only be available quarterly, so take old job vacancy data time-shifted to the middle of the quarter and add on the new monthly data. .medium-font[ ``` r jv_data %>% filter(GeoUID=="59") %>% select(Date,GeoUID,`Job vacancies`=VALUE) %>% tail ``` ``` ## # A tibble: 6 × 3 ## Date GeoUID `Job vacancies` ## <date> <chr> <dbl> ## 1 2024-12-01 59 66835 ## 2 2025-01-01 59 83780 ## 3 2025-02-01 59 72660 ## 4 2025-03-01 59 84275 ## 5 2025-04-01 59 86405 ## 6 2025-05-01 59 83195 ``` ] --- class: medium-code ## Joining the two data series The time series are collected with different frequencies, we want to only look at series where they match. An *inner join* will do that for us. We are also cutting down to four provinces. ``` r joined_jobs_data <- inner_join(lfs_data %>% select(Date,GeoUID,GEO,Employment=VALUE), jv_data %>% select(Date,GeoUID,`Job vacancies`=VALUE), by=c("Date","GeoUID")) %>% filter(GEO %in% c("Alberta","British Columbia","Ontario","Quebec")) tail(joined_jobs_data) ``` ``` ## # A tibble: 6 × 5 ## Date GeoUID GEO Employment `Job vacancies` ## <date> <chr> <fct> <dbl> <dbl> ## 1 2020-11-01 59 British Columbia 2479. 93835 ## 2 2020-11-01 59 British Columbia 2479. 91755 ## 3 2020-12-01 24 Quebec 4260. 128740 ## 4 2020-12-01 35 Ontario 7299. 183700 ## 5 2020-12-01 48 Alberta 2222. 35415 ## 6 2020-12-01 59 British Columbia 2494. 79100 ``` We now have one table with both, filled and vacant jobs, for all regions and dates for which we have data on both. --- class: medium-code ## Reshaping the data This time we are going to reshape the data from *wide* to *long* format. This is what makes it easiest to plot. ``` r jobs_data <- joined_jobs_data %>% pivot_longer(c("Job vacancies","Employment"), names_to="Type",values_to="Jobs") %>% mutate(Type=factor(Type,levels=c("Job vacancies","Employment"))) tail(jobs_data) ``` ``` ## # A tibble: 6 × 5 ## Date GeoUID GEO Type Jobs ## <date> <chr> <fct> <fct> <dbl> ## 1 2020-12-01 35 Ontario Job vacancies 183700 ## 2 2020-12-01 35 Ontario Employment 7299. ## 3 2020-12-01 48 Alberta Job vacancies 35415 ## 4 2020-12-01 48 Alberta Employment 2222. ## 5 2020-12-01 59 British Columbia Job vacancies 79100 ## 6 2020-12-01 59 British Columbia Employment 2494. ``` We can now easily do a stacked bar chart plotting jobs over time for each region, stacking the filled and vacant jobs. --- class: medium-code ## Jobs (filled and vacant) ``` r ggplot(jobs_data,aes(x=Date,y=Jobs,fill=Type)) + geom_bar(stat="identity") + facet_wrap("GEO",scales="free_y",nrow=1) + scale_y_continuous(labels=scales::comma) + labs(title="Jobs by economic region",fill=NULL,x=NULL,y=NULL, caption="StatCan Tables 14-10-0293, 14-10-0325, 14-10-0371") ``` <!-- --> --- class: medium-code ## Income by age groups ``` r income_age_groups <- c("16 to 24 years", "25 to 34 years" , "35 to 44 years" , "45 to 54 years" , "55 to 64 years", "65 years and over") income_data <- get_cansim("11-10-0239") %>% normalize_cansim_values(factors = TRUE) %>% filter(GEO=="Canada", Sex=="Both sexes", Statistics=="Median income (excluding zeros)", `Income source`=="Total income", `Age group` %in% income_age_groups) ``` Sometimes we want to do several similar plots, it can be useful to define a custom theme. ``` r line_theme <- list( geom_line(), geom_point(data=~filter(.,Date==max(Date))), scale_color_brewer(palette="Dark2",guide=FALSE), theme_light(), expand_limits(x=as.Date("2025-01-01")), ggrepel::geom_text_repel(data=~filter(.,Date==max(Date)),hjust=-0.1, color='black',direction="y",size=3), scale_y_continuous(labels=scales::dollar) ) ``` --- class: medium-code ## Income by age groups ``` r ggplot(income_data,aes(x=Date,y=VALUE,color=`Age group`,label=`Age group`)) + line_theme + labs(title="Median income by age group in Canada", x=NULL, y=unique(income_data$UOM), caption="StatCan Table 11-10-0239") ``` <!-- --> --- class: medium-code ## Wealth ``` r wealth_age_groups <- c("Under 35 years", "35 to 44 years" , "45 to 54 years", "55 to 64 years" , "65 years and older") wealth_data <- get_cansim("11-10-0016") %>% normalize_cansim_values(factors=TRUE) %>% filter(GEO=="Canada", `Assets and debts`=="Net worth (total assets less total debt)", Statistics=="Median value for those holding asset or debt", `Economic family type`!="Economic families and persons not in an economic family", `Age group` %in% wealth_age_groups) %>% select(GEO,Date,`Age group`,`Confidence intervals`,`Economic family type`,UOM,VALUE) %>% pivot_wider(names_from="Confidence intervals",values_from="VALUE") ``` Wealth data needs a bit more processing. The SFS is not that deep and confidence intervals can be large, so we want to pay attention to that. Also, the economic family type really matters, so we want to break that out. ``` r head(wealth_data) ``` ``` ## # A tibble: 6 × 8 ## GEO Date `Age group` `Economic family type` UOM Estimate ## <fct> <date> <fct> <fct> <chr> <dbl> ## 1 Canada 1999-01-01 Under 35 years Economic families 2023… 72200 ## 2 Canada 1999-01-01 Under 35 years Persons not in an economic fa… 2023… 8500 ## 3 Canada 1999-01-01 35 to 44 years Economic families 2023… 201100 ## 4 Canada 1999-01-01 35 to 44 years Persons not in an economic fa… 2023… 51300 ## 5 Canada 1999-01-01 45 to 54 years Economic families 2023… 391000 ## 6 Canada 1999-01-01 45 to 54 years Persons not in an economic fa… 2023… 89000 ## # ℹ 2 more variables: `Lower bound of a 95% confidence interval` <dbl>, ## # `Upper bound of a 95% confidence interval` <dbl> ``` --- class: medium-code ## Wealth ``` r ggplot(wealth_data,aes(x=Date,y=Estimate,color=`Age group`,label=`Age group`)) + geom_ribbon(aes(ymin=`Lower bound of a 95% confidence interval`, ymax=`Upper bound of a 95% confidence interval`),fill="grey",alpha=0.3,size=0) + line_theme + facet_wrap("`Economic family type`") + labs(title="Median net worth by age group in Canada",x=NULL,y=unique(wealth_data$UOM), caption="StatCan Table 11-10-0239") ``` <!-- --> --- ## Recap * APIs, like StatCan NDM, make it easy to pull in data as needed. * scripting data processing in R (or other scripting languages) makes analysis transparent, reproducible, and adaptable. * simply re-run the scripts when new data becomes available. * to collaborate just share the code, don't need to worry about sharing data and keeping data up-to-date. * iterative process: can easily add data analysis and visualization. * packages like **cansim** provides stability against API changes, they abstract the changes under the hood and deliver (more) stable results, and offer higher-level processing functionality. -- * Still need to perform basic data processing steps. Common steps are: - filter() -- filter the data down to what you need - select() -- remove clutter and select the columns you need - group_by() %>% summarize() -- group and summarize data - ..._join() -- join two data series along common attributes (left_join, inner_join, ...) - pivot_wider() -- convert from long form to wide form - pivot_longer() -- convert from wide form to long form --- class: inverse center ## Thanks for listening and coding along! The RMarkdown file for this presentation is available on my github (linked in the chat), in case anyone wants to download the code and adapt it for their own purposes. <div style="height:10%;"></div> <hr> The [official documentation](https://mountainmath.github.io/cansim/index.html) has more references and examples for the {cansim} package. For more complex analysis and code using cansim, you can check out Jens' [blog](https://doodles.mountainmath.ca/)